v0.2.5: split BPF datapath into fast_path + finalize via bpf_tail_call by lunarthegrey · Pull Request #45 · unredacted/packetframe

lunarthegrey · 2026-05-05T03:39:17Z

Summary

Fixes the v0.2.4 stack-budget regression on UniFi 5.15 kernels by splitting the BPF datapath into two programs connected by bpf_tail_call. Each program gets its own 512-byte stack budget. Same architecture establishes the pattern for future fast-path-internal stages (additional packet transforms, more sophisticated FIB logic) without re-bisecting stack bytes every release.

The proximate failure (reported on edge1-mci1-net): UniFi's 5.15.72-ui-cn9670 aarch64 kernel rejected v0.2.4's fast_path BPF program at combined stack size of 3 calls is 544. Too large — same bytecode that loaded cleanly on CI's qemu 5.15 vanilla x86_64 (stack depth 0+360+0+0). UniFi's BPF patches plus the aarch64 JIT account stack ~120 bytes higher than vanilla.

Architecture

                       packet ingress
                              │
                              ▼
   ┌──────────────────────────────────────────────────┐
   │ fast_path  (XDP, attached to eth0..ethN)         │  Frame A
   │   classification (allow-prefix, block-prefix)     │  fits 512B
   │   FIB lookup  (kernel-fib | custom-fib | compare)│
   │   devmap pre-check                                │
   │   TTL decrement / L2 rewrite (in-place)           │
   │   write per-CPU MUTATION_CTX                      │
   │   bpf_tail_call(MUTATION_PROGS, 0)  ──────────┐  │
   └────────────────────────────────────────────────│──┘
                                                    │
                                                    ▼
   ┌──────────────────────────────────────────────────┐
   │ finalize  (XDP, tail-called by fast_path)        │  Frame B
   │   read MUTATION_CTX                               │  fresh 512B
   │   mss-clamp lookup + (optional) MSS rewrite       │
   │   VLAN choreography (push / pop / rewrite)        │
   │   bpf_redirect_map(egress_ifindex)                │
   └──────────────────────────────────────────────────┘

mss-clamp + VLAN + redirect move from forward_success into finalize. Per-prefix LPM keys + TCP-options walk live in finalize's fresh stack budget. fast_path's responsibilities shrink to classification + L2/TTL, which fits comfortably under any kernel's accounting.

This is not the multi-module dispatcher (SPEC §3.4 / §5.0) — that's for chaining independent modules at the same hook (ddos in front of fast-path, sampler behind it). Tail-call is for splitting one logical pipeline. Both eventually exist; v0.2.5 ships only the former.

What's in the PR

Area	Files	Notes
New BPF program	crates/modules/fast-path/bpf/src/finalize.rs	`#[xdp] pub fn finalize` — reads MUTATION_CTX, runs mss-clamp + VLAN + redirect. ~280 LOC; mss-clamp + VLAN choreography moved here verbatim from main.rs.
BPF maps	maps.rs	New `MutationCtx` struct (16 bytes), `MUTATION_CTX` (PerCpuArray, single-element scratch), `MUTATION_PROGS` (ProgramArray, 8 slots). New StatIdx 35/36 (`err_tail_call`, `err_mutation_ctx`).
BPF program	main.rs	`forward_success` writes MutationCtx and tail-calls into MUTATION_PROGS[0] instead of doing mss-clamp + VLAN + redirect inline. ~440 LOC of mss-clamp + VLAN choreography moved out (now in finalize.rs).
Userspace lifecycle	linux_impl.rs	`attach()` loads finalize first → populates MUTATION_PROGS[0] with finalize's FD → loads + attaches fast_path. Order matters. New `populate_mutation_progs` helper. New `tail_call_chain_from_pin` for status reporting.
Pin lifecycle	pin.rs	New `FINALIZE_PROGRAM_NAME` constant + `PROGRAM_NAMES` array. MAP_NAMES grows to 19 (added MSS_CLAMP_V4/V6/BY_IFACE that were missing from v0.2.4, plus MUTATION_CTX/PROGS). `pin_program_and_maps` walks both program names.
CLI status	loader.rs	New "tail-call chain" section reports MUTATION_PROGS[0] occupancy. Stat names array updated for indices 33-36 (mss-clamp + tail-call diagnostics — bringing it in sync with what's actually defined).
Test harness	tests/common/mod.rs	`Harness::new` now loads both programs and populates MUTATION_PROGS[0] before returning. `bpf_prog_test_run` follows tail-calls (kernel re-enters its dispatcher for the target program), so existing tests transparently see the full chain's verdict + mutations.
Docs	README.md, new docs/runbooks/tail-call-architecture.md	Status table row for v0.2.5+; new runbook covers chain topology, MutationCtx wire format, debug commands (`bpftool prog show`, `bpftool map dump MUTATION_PROGS`), and how future stages slot in.
Version	Cargo.toml, VERSION, README install snippets	0.2.4 → 0.2.5

What's deliberately NOT in this PR

Netns end-to-end integration test (tests/tail_call.rs was in the plan). The kernel BPF_PROG_TEST_RUN harness already exercises the tail-call via the existing fixtures (now updated to populate MUTATION_PROGS[0]). A real-veth + AF_PACKET capture test is good additional coverage but ~150 LOC of test infra and was the right thing to defer to keep this PR focused on the architecture change. Will land as a follow-up.
bpf_fib_lookup per-CPU map move (mentioned as "alternative B" in earlier discussion). Tail-call obviates it; we have plenty of stack headroom now.
Multi-stage chain (slots 1-7 of MUTATION_PROGS). Reserved capacity is in place for future use; no actual stages today.
The dispatcher (SPEC §5.0). Different problem entirely — for ddos / sampler / randomizer composition.

CI expectations

Existing CI matrix should pass:

✅ fmt + clippy + test (workspace lib tests pass on macOS dev: 94 + 40)
✅ Cross-build matrix (4 targets) — userspace-only, no BPF concern
✅ qemu-verifier 5.15 + 6.6 — should load both programs, attach succeeds, sudo-gated test fixtures pass through the tail-call chain

The qemu jobs are the meaningful test for "does the verifier accept this on real kernels." If they pass, vanilla 5.15 / 6.6 are fine. UniFi-style stricter accounting will be confirmed via post-merge deployment on the same router that hit the v0.2.4 regression.

Test plan

Pre-merge (CI):

cargo fmt --all --check clean
cargo clippy --workspace --all-targets --all-features -- -D warnings clean
cargo test --workspace --lib 94 + 40 pass on macOS dev host
CI fmt+clippy+test passes
CI cross-build (4 targets) passes
CI qemu-verifier 5.15 + 6.6 passes (with updated harness following the chain)

Post-merge on the deployed UniFi router:

apt install ./packetframe_0.2.5_arm64.deb. sudo systemctl restart packetframe.
sudo packetframe feasibility --config /etc/packetframe/packetframe.conf --human — all xdp.attach.ethN now PASS (no more 544/512 rejection).
sudo packetframe status — the new "tail-call chain" section reports MUTATION_PROGS[0]: populated (finalize).
Add a per-prefix mss-clamp directive: mss-clamp 23.191.200.0/24 1360. sudo packetframe reconfigure.
tcpdump -i eth2 -n 'tcp[tcpflags] & tcp-syn != 0' -vv confirms wire MSS=1360 on outbound SYNs.
sudo packetframe status | grep mss_clamp_applied shows the counter climbing.
err_tail_call and err_mutation_ctx stay at 0.

Tag flow after merge

git checkout main && git pull
git tag -a v0.2.5 -m "v0.2.5"
git push origin v0.2.5

(The version bump is in this PR — same pattern as v0.2.4 — so just tag and push.)

🤖 Generated with Claude Code

Fixes the v0.2.4 regression on UniFi 5.15.72-ui-cn9670 (aarch64) where the kernel rejected fast_path with "combined stack size of 3 calls is 544. Too large" — same bytecode that loaded cleanly on CI's qemu 5.15 (stack 0+360+0+0). UniFi's BPF patches plus aarch64 JIT account stack ~120 bytes higher than vanilla 5.15 on x86_64. Architecture: two XDP programs in one ELF, chained by bpf_tail_call. Each gets its own 512-byte stack budget. fast_path (XDP, attached per-iface): classification (allow-prefix, block-prefix, dry-run) FIB lookup (kernel-fib | custom-fib | compare) devmap pre-check TTL decrement (in-place) L2 rewrite (in-place) write per-CPU MUTATION_CTX bpf_tail_call(MUTATION_PROGS, 0) ────────► finalize (XDP, tail-called): read MUTATION_CTX mss-clamp lookup + mutation VLAN choreography bpf_redirect_map mss-clamp + VLAN + redirect move from forward_success into the new finalize program; per-prefix LPM keys + TCP-options walk live in finalize's fresh stack budget. fast_path's responsibilities shrink to classification + L2/TTL mutation, which fits comfortably under any kernel's accounting. This is NOT the multi-module dispatcher (SPEC §3.4 / §5.0). Tail-call is one-way control transfer between cooperating stages of one logical pipeline; the dispatcher is for chaining independent modules at the same hook (ddos, sampler). Both will eventually exist; v0.2.5 ships only the former. New BPF maps: * MUTATION_CTX (PerCpuArray<MutationCtx>): per-CPU scratch carrying egress_ifindex, egress_vid, ingress_vid, ip_offset, is_v4 across the tail-call boundary. fast_path writes, finalize reads. * MUTATION_PROGS (ProgramArray, 8 slots): jump table. Slot 0 holds finalize today; slots 1-7 reserved for future stages. New StatIdx counters (append-only): * 35: err_tail_call — fast_path's tail_call returned an error (slot empty). fast_path falls through to XDP_PASS so traffic still flows. * 36: err_mutation_ctx — finalize couldn't read MUTATION_CTX. Should be 0 in steady state. Userspace lifecycle changes: * attach() loads finalize first, populates MUTATION_PROGS[0], then loads + attaches fast_path. Order matters: fast_path's first packet must find a populated slot. * pin_program_and_maps walks PROGRAM_NAMES (fast_path + finalize); both pins survive SIGTERM per SPEC §8.5. * MAP_NAMES grows to include MUTATION_CTX, MUTATION_PROGS, and the v0.2.4 mss-clamp maps that were missing from the previous list. * Status command reports tail-call chain occupancy ("MUTATION_PROGS[0]: populated (finalize)") so operators can confirm wiring. Test harness: * Harness::new() now loads both programs and populates MUTATION_PROGS[0] before returning, so existing bpf_prog_test_run-based tests follow the chain transparently. Kernel's BPF_PROG_TEST_RUN handles bpf_tail_call by re-entering its dispatcher for the target program; tests see the verdict + mutations from the full chain. Version bumped 0.2.4 → 0.2.5. README Status table grows a "Two-stage BPF datapath" row. New runbook at docs/runbooks/tail-call-architecture.md documents the chain, MutationCtx wire format, debug commands, and how future stages slot in. Netns end-to-end integration test (real veth + SYN + capture, asserts MSS clamped on the wire) is deferred to a follow-up PR. Existing attach-roundtrip + bpf_prog_test_run fixtures in qemu-verifier validate that both programs LOAD + attach + the tail-call wires correctly on kernels 5.15 + 6.6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The newer-kernel verifier on the GitHub Actions runner rejected the single-entry mss_clamp_inline at the proto-byte read with `R9 offset is outside of the packet`. The bound check used a runtime-conditional size (`if is_v4 { 20 } else { 40 }`), which the verifier could not connect to the subsequent typed cast through `*const Ipv4Hdr` — so the read at offset 9 (proto field) appeared unbounded. Splitting the dispatch upfront lets each path bound-check with a compile-time constant (`Ipv4Hdr::LEN` / `Ipv6Hdr::LEN`) immediately followed by the cast and field reads — the same `ptr_at` pattern main.rs already uses. The qemu kernels (5.15, 6.6) accepted the old form; the newer runner kernel did not. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The verifier's `find_good_pkt_pointers` refuses to propagate readable- range info through packet-pointer arithmetic when the scalar offset's umax_value exceeds MAX_PACKET_OFF (0xffff). `mctx.ip_offset` is read from a per-CPU map, so the verifier sees its full u32 range and skips range propagation — leaving the post-bound-check pkt pointer with range=0 and rejecting the subsequent header field read. Capping `ip_offset` at MAX_IP_OFFSET (64) right after the MUTATION_CTX read gives the verifier a tight umax it can reason about. fast_path writes 14 or 18 in practice; 64 leaves headroom for a future second VLAN tag. Out-of-range is fail-safe XDP_PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Capping ip_offset at 64 (previous commit) got the verifier past the IP header reads but the TCP csum patch still hit "R6 offset is outside of the packet" at byte 17 of the TCP header. The bound check on `start + csum_off + 2 > end` did not propagate readable-range back to the actual read site because LLVM emitted a fresh packet-pointer arithmetic chain (new id) for the read. v0.2.4's working pattern derived `ip_offset = (ip as usize) - start` inside mss_clamp_inline, where the verifier tracks the result as a `pkt - pkt` subtraction with `umax = MAX_PACKET_OFF (0xffff)` — a pkt-derived bound that range propagation honors. Pulling the same pattern into finalize: pass the typed `ip` pointer (already bounds-checked) into `mss_clamp_tcp` and recover ip_offset there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

lunarthegrey and others added 4 commits May 4, 2026 22:38

lunarthegrey merged commit 04bf598 into main May 5, 2026
10 checks passed

lunarthegrey deleted the v0.2.5-tail-call-finalize branch May 5, 2026 04:49

lunarthegrey mentioned this pull request May 5, 2026

release: bump to v0.2.6 #47

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.5: split BPF datapath into fast_path + finalize via bpf_tail_call#45

v0.2.5: split BPF datapath into fast_path + finalize via bpf_tail_call#45
lunarthegrey merged 4 commits into
mainfrom
v0.2.5-tail-call-finalize

lunarthegrey commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lunarthegrey commented May 5, 2026

Summary

Architecture

What's in the PR

What's deliberately NOT in this PR

CI expectations

Test plan

Tag flow after merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant